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Abstract 

We present the group fused Lasso for detection of multiple change-points shared by a set of co- 
occurring one-dimensional signals. Change-points are detected by approximating the original signals 
with a constraint on the multidimensional total variation, leading to piecewise-constant approximations. 
Fast algorithms are proposed to solve the resulting optimization problems, either exactly or approxi- 
mately. Conditions are given for consistency of both algorithms as the number of signals increases, 
and empirical evidence is provided to support the results on simulated and array comparative genomic 
hybridization data. 

1 Introduction 

Finding the place (or time) where most or all of a set of one-dimensional signals (or profiles) jointly change 
in some specific way is an important question in several fields. A common situation is when we want to find 
change-points in a multidimensional signal, e.g., in audio and image processing LL2J> to detect intrusion in 
computer networks ||3]lll^ or in financial and economics time series analysis ||5l. Another important situation 
is when we are confronted with several 1 -dimensional signals which we believe share common change- 
points, e.g., genomic profiles of a set of patients. The latter application is increasingly important in biology 
and medicine, in particular for the detection of copy-number variation along the genome |6|, or the analysis 
of microarray and genetic linkage studies [7 |. The common thread in biological applications is the search 
for data patterns shared by a set of individuals, such as cancer patients, at precise places on the genome; in 
particular, sudden changes in measured values. As opposed to the segmentation of multidimensional signals 
such as speech, where the dimension is fixed and collecting more data means having longer profiles, the 
length of signals in genomic studies (i.e., the number of probes measured along the genome) is fixed for a 
given technology while the number of signals (i.e., the number of individuals) can increase when we collect 
data about more patients. From a statistical point of view, it is therefore of interest to develop methods that 
identify multiple change-points shared by several signals that can benefit from increasing the number of 
signals. 

There exists a vast literature on the change-point detection problem ||8]|9]|. Here we focus on compu- 
tationally efficient methods to segment a multidimensional signal by approximating it with a piecewise- 
constant one, using quadratic error criteria. It is well-known that, in this case, the optimal segmentation of 
a p-dimensional signal of length n into k segments can be obtained in 0{n'^pk) by dynamic programming 
|[T0l[miT2l . However, the quadratic complexity in n is prohibitive in applications such as genomics, where 
n can be in the order of 10^ to 10^ with current technology. An alternative to such global procedures, which 
estimate change-points as solutions of a global optimization problem, are fast local procedures such as bi- 
nary segmentation 1131 . which detect breakpoints by iteratively applying a method for single change-point 
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detection to the segments obtained after the previous change-point is detected. While such recursive meth- 
ods can be extremely fast, in the order of 0{nplog{k)) when the single change-point detector is 0{np), 
quality of segmentation is questionable when compared with global procedures 1 14] . 

For p = 1 (a single signal), an interesting alternative to these global and local procedures is to express 
the optimal segmentation as the solution of a convex optimization problem, using the (convex) total variation 
instead of the (non-convex) number of jumps to penalize a piecewise-constant function in order to approx- 
imate the original signal |[T5l [16.1 . The resulting piecewise-constant approximation of the signal, defined 
as the global minimum of the objective function, benefits from theoretical guaranties in terms of correctly 
detecting change-points lUTllTSl . and can be implemented efficiently in 0{nk) or 0{n\og{n)) |[T9l[T7ll20l . 

In this paper we propose an extension of total-variation based methods for single signals to the mul- 
tidimensional setting, in order to approximate a multidimensional signal with a piecewise-constant signal 
with multiple change-points. We define the approximation as the solution of a convex optimization problem 
which involves a quadratic approximation error penalized by the sum of the Euclidean norms of the mul- 
tidimensional increments of the function. The problem can be reformulated as a group Lasso fTT], which 
we show how to solve exactly and efficiently. Alternatively, we provide an approximate yet often compu- 
tationally faster solution to the problem using a group LARS procedure ||2TI . In the latter case, using the 
particular structure of the design matrix, we can find the first k change-points in 0{npk), thus extending the 
method of fTT] to the multidimensional setting. 

Unlike most previous theoretical investigations of change-point methods (e.g., lITTl [TSl ). we are not 
interested in the case where the dimension p is fixed and the length of the profiles n increases, but in the 
opposite situation where n is fixed and p increases. Indeed, this corresponds to the case in genomics where, 
for example, n would be the fixed number of probes used to measure a signal along the genome, and p 
the number of samples or patients analyzed. We want to design a method that benefits from increasing 
p in order to identify shared change-points, even though the signal-to-noise ratio may be very low within 
each signal. As a first step towards this question, we give conditions under which our method is able to 
consistently identify a single change-point as p increases. We also show by simulation that the method is 
able to correctly identify multiple change-points as p — > +oo, validating its relevance in practical settings. 

The paper is organized as follows. After fixing notation in Section |2l we present the group fused Lasso 
method in Section |3] We propose two efficient algorithms to solve it in Section |4j and discuss its theoretical 
properties in Section [5] Lastly, we provide an empirical evaluation of the method and a comparison with 
other methods in the study of copy number variations in cancer in Section |6] A preliminary version of this 
paper was pubhshed in li22J . 

2 Notation 

For any two integers u < v, we denote by [u, v] the interval {u,u + 1, . . . ,v}. For any u x v matrix M 
we note Mj j its (i, j)-th entry, and ||M|| = Z]j=i ^^ij Frobenius norm (or Euclidean norm in 

the case of vectors). For any subsets of indices A = (oi, . . . , a\A\) S [1, n]'"^' and B = . . . , b^B\) S 
we denote by Ma,b the \A\ x \B\ matrix with entries Ma^^bj for € [1, |^|] x [1, \B\]. For 
simplicity we will use • instead of [1, u] or [l,v], i.e., Ai^, is the i-th row of A and A,j is the j-th column 
of A. We note lu,v the u x v matrix of ones, and Ip the p x p identity matrix. 

3 Formulation 

We consider p real-valued profiles of length n, stored in an n x p matrix Y. The i-th profile Y, j = 
{Yi^i, . . . , Yn^i) is the i-th column of Y. We model each profile as a piecewise-constant signal corrupted 
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by noise, and assume that change-point locations tend to be shared across profiles. Our goal is to detect 
these shared change-points, and benefit from the possibly large number p of profiles to increase the statisti- 
cal power of change-point detection. 



3.1 Segmentation with a total variation penalty 

When p = 1 (a single profile), a popular method to find change-points in a signal is to approximate it by a 
piecewise-constant function using a quadratic error criterion, i.e., to solve 

n-1 

mill \\Y -U f subject to ^ S{Ui+i - Ui) < k , (1) 

1=1 

where 5 is the Dirac function, equal to if its argument is null, 1 otherwise. In other words, ([B expresses the 
best approximation of y by a piecewise-constant profile U with at most k jumps. It is well-known that ([T]) 
can be solved in 0{'n?k) by dynamic programming [lOlliilllll- Although very fast when n is of moderate 
size, the quadratic dependency in n renders it impractical in current computers when n reaches millions or 
more, which is often the case in many application such as segmentation of genomic profiles. 

An alternative to the combinatorial optimization problem ([T]) is to relax it to a convex optimization 
problem, by replacing the number of jumps by the convex total variation (TV) 1. 15] . i.e., to consider: 

1 ""^ 
min -||y-C/f + AV|C/i+i-C/i| . (2) 

i=l 

For a given A > 0, the solution U G M" of Q is again piecewise-constant. Recent work has shown that 
Q can be solved much more efficiently than ([T]): 1 19] proposed a fast coordinate descent-like method, II17I 
showed how to find the first k change-points iteratively in 0{nk), and f20l proposed a 0(nln(n)) method 
to find all change-points. Adding penalties proportional to the li or £2 norm of C/ to Q does not change 
the position of the change-points detected |[T6ll23]| . and the capacity of TV denoising to correctly identify 
change-points when n increases has been investigated in ifTTlfTSl . 

Here, we propose to generalize TV denoising to multiple profiles by considering the following convex 
optimization problem, for Y G M"^^: 

^ n— 1 

mill ||y-C/||2 + AV||[/i+i,. (3) 

1=1 

The second term in Q can be considered a multidimensional TV: it penalizes the sum of Euclidean norms 
of the increments of U, seen as a time-dependent multidimensional vector, and reduces to the classical 1- 
dimensional TV whenp = 1. Intuitively, when A increases, this penalty will enforce many increment vectors 
f^j+i,« — Ui^, to collapse to 0, just like the total variation in (jj)) in the case of 1-dimensional signals. This 
implies that the positions of non-zero increments will be the same for all profiles. As a result, the solution to 
^ provides an approximation of the profiles y by an n x p matrix of piecewise-constant profiles U which 
share change-points. 

While ([3]l is a natural multidimensional generalization of the classical TV denoising method Q, we 
more generally investigate the following variant: 

rrll2 I \Sr^\\Ui+l,» — Ui,\\ 
mm +A> '— '■ — , (4) 
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where are position-dependant weights which affect the penaUzation of the jump differently at 

different positions. While (01) boils down to ^ for uniform weights di = I, i = l,...,n — 1, we will see 
that the unweighted version suffers from boundary effects and that position-dependent schemes such as: 



ViE[l,n-l], di = J- -, (5) 

y i{n - i) 

are both theoretically and empirically better choices. 

To illustrate the grouping effect of the penalty in (HJl, Figure [T] compares the segmentation of three sim- 
ulated profiles obtained with and without enforced sharing of change-points across profiles. We simulated 
three piecewise-constant signals corrupted by independent additive Gaussian noise. All profiles have length 
500 and share the same 5 change-points, though with different amplitudes, at positions 38, 139, 268, 320 
and 397. On the left-hand side, we show the first 5 change-points captured by TV denoising with weights 
^ applied to each signal independently. On the right, we show the first 5 change-points captured by formu- 
lation dH). We see that the latter formulation finds the correct change-points, whereas treating each profile 
independently leads to errors. For example, the first two change-points have a small amplitude in the second 
profile and are therefore very difficult to detect from the profile only, while they are very apparent in the first 
and third profiles. 
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Figure 1 : First 5 change-points detected on three simulated profiles by TV denoising of each profile (left) 
and by joint TV denoising (right). 



3.2 Reformulation as a group Lasso problem 

It is well-known that the 1 -dimensional TV denoising problem Q can be reformulated as a Lasso regres- 
sion problem by an appropriate change of variable 1,17 J . We now show that our generalization Q can be 
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reformulated as a group Lasso regression problem, which will be convenient for theoretical analysis and 
implementation ll2ni . To this end, we make the change of variables (/3, 7) e x given by: 

7 = Ui,, , 

A,. = ^'^"^'*, — ^ fori = l,...,n- 1. 

di 

In other words dif^ij is the jump between the i-th and the (i + l)-th positions of the j-th profile. We 
immediately get an expression for U as a function of /3 and 7: 

i-l 

Ui^, = 7 + ^ djl3j^, for z = 2, . . . , n . 

i=i 

This can be rewritten in matrix form as 

U = ln,i7 + X(3 , (6) 

where X is the n x (n — 1) matrix with entries Xij = dj for i > j, and otherwise. Making this change 
of variable, we can re-express (HJl as follows: 

^ n—l 

min -||y-X/3-l„,i7f + AV||/3i,. ||. (7) 

For any /3 G the minimum in 7 is attained with 7 = li,n(^ — Xf})/n. Plugging this into (|7]l, we 

get that the matrix of jumps /3 is solution of 

^ n— 1 

min + A V|| /3i,.||, (8) 

(n — l)xp Z ^ — ^ 

where y and X are obtained from y and X by centering each column. 

Equation (H) is now a classical group Lasso regression problem ll2ll . with a specific design matrix X 
and groups of features corresponding to the rows of the matrix (3. The solution /3 of ([8]l is related to the 
solution U of our initial problem (011 by equation 



4 Implementation 

Although @ and (O are convex optimization problems that can in principle be solved by general-purpose 
solvers [24], we want to be able to work in dimensions that reach millions or more, making this compu- 
tationally difficult. In particular, the design matrix X in ([D is a non-sparse matrix of size n x (n — 1), 
and cannot even fit in a computer's memory when n is large. Moreover, we would ideally like to obtain 
solutions for various values of A, corresponding to various numbers of change-points, in order to be able to 
select the optimal number of change-points using statistical criteria. In the single profile case (p = 1), fast 
implementations in 0{nk) or O(nlnn) have been proposed 1 19l[T7ll20,l . However, none of these methods 
is applicable directly to the p > 1 setting since they all rely on specific properties of the p = 1 case, such as 
the fact that the solution is piecewise-affine in A and that the set of change-points is monotically decreasing 
with A. 

In this section we propose two algorithms to respectively exactly or approximately solve (01) efficiently. 
We adopt the algorithms suggested by 11211 to solve the group Lasso problem (H) and show how they can be 
implemented very efficiently in our case due to the particular structure of the regression problem. We have 
placed in Annex A several technical lemmas which show how to efficiently perform several operations with 
the given design matrix X that will be used repeatedly in the implementations proposed below. 
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4.1 Exact solution by block coordinate descent 

A first possibility to solve the group Lasso problem ^ is to follow a block coordinate descent approach, 
where each group is optimized in turn with all other groups fixed. It can be shown that this strategy converges 
to the global optimum, and is reported to be stable and efficient 1(211 l25l . As shown by [21], it amounts to 
iteratively applying the following equation to each block i = l,...,n — lin turn, until convergence: 

S,, (9) 

7i V \\St\\J + 

where 7j = || X,^i |p = i{n — i)(if/n and Si = X~^.- (Y — and where denotes the (n — 1) x p 

matrix equal to /3 except for the i-th row {3^1 = 0. The convergence of the procedure can be monitored by 
the Karush-Kuhn-Tucker (KKT) conditions: 

-Xl^{Y -Xp) + ^f^ = Q V/3,„./0, 

lift,. II (10) 

II - X'J^i [Y - XP) II < A Vft,. = . 

Since the number of blocks n can be very large and we expect only a fraction of non-zero blocks at the op- 
timum (corresponding to the change-points), we implemented this block coordinate descent with an active 
set strategy. In brief, a set of active groups A corresponding to non-zero groups is maintained, and the algo- 
rithm alternates between optimizing /3 over the active groups in A and updating A by adding or removing 
groups based on violation of the KKT conditions. The resulting pseudo-code is shown in Algorithm [T] The 
inner loop (lines 3-7) corresponds to the optimization of /3 on the current active groups, using iteratively 
block coordinate descent Q. After convergence, groups that have been shrunk to are removed from the 
active set (line 8), and the KKT conditions are checked outside of the active set (lines 9-10). If they are not 
fulfilled, the group that most violates the conditions is added to the active set (line 1 1), otherwise the current 
solution satisfies all KKT conditions and is therefore the global optimum (line 13). 

Although it is difficult to estimate the number of iterations needed to reach convergence for a certain 
level of precision, we note that by Lemma |5] (Annex A), computation of X^Y in line 1 can be done in 
0{np), and each group optimization iteration (lines 3-7) requires computing X^^X,,^ (line 5), done in 
0(1^1) (see Lemma[6]in Annex A), then computing Si (line 5) in 0{\A\p) and soft-thresholding (line 6) in 
0{p). The overall complexity of each group optimization iteration is therefore 0{\A\p). Since each group 
in A must typically be optimized several times, we expect complexity that is at least quadratic in |^| and 
linear in p for each optimization over an active set A (lines 3-7). To check optimality of a solution after 
optimization over an active set A, we need to compute X^ Xp (line 9) which takes 0{np) (see Lemma 
|7j Annex A). Although it is difficult to upper bound the number of iterations needed to optimize over A, 
this shows that a best-case complexity to find k change-points, if we correctly add groups one by one to 
the active set, would be 0{npk) to check k times the KKT conditions and find the next group to add, and 
0{pk^) in total if each optimization over an active set is in 0(p|^p). In Section [6l we provide some 
empirical results on the behavior of this block coordinate descent strategy. 

4.2 Group fused LARS implementation 

Since exactly solving the group Lasso with the method described in Section 14.11 can be computationally 
intensive, it may be of interest to find fast, approximate solutions to (H). We propose to implement a strategy 
based on the group LARS, proposed in [2T| as a good way to approximately find the regularization path of 
the group Lasso. More precisely, the group LARS approximates the solution path of dSjl with a piecewise- 
affine set of solutions and iteratively finds change-points. The resulting algorithm is presented here as 
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Algorithm 1 Block coordinate descent algorithm 



Require: centered data Y, regularization parameter A. 
1: Initialize ^ ^ 0, /3 = 0, C ^ X^Y. 
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Check KKT: S - 
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u *r- argmaXj^^ Si^, 
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if M > A2 then 




12 


Add a new group: A 
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else 




14 


return /3. 
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end if 




16 


end loop 





Algorithm |2l and is intended to approximately solve Change-points are added one by one (lines 4 
and 8), and for a given set of change-points the solution moves straight along a descent direction (line 6) 
with a given step (line 7) until a new change-point is added (line 8). We refer to [.21] for more details and 
justification for this algorithm. 

While the original group LARS method requires storage and manipulation of the design matrix [21 ], 
implausible for large n here, we can again benefit from the computational tricks provided in Annex A to 
efficiently run the fast group LARS method. Computing X^Y in line 1 can be done in 0{np) using Lemma 
in To compute the descent direction (line 6), we first compute w in 0(|>4^|p) using Lemma[8l then a in 0{np) 
using Lemma|7] To find the descent step (line 7), we need to solve n polynomial equations of degree 2, the 
coefficients of which are computed in 0{p), resulting in a 0{np) complexity. Overall the main loop for each 
new change-point (lines 2-10) takes 0{np) in computation and memory, resulting in 0{npk) complexity in 
time and 0{np) in memory to find the first k change-points. We provide in Section [6] empirical results that 
confirm this theoretical complexity. 

5 Theoretical analysis 

In this section, we study theoretically to what extent the estimator (HJl recovers correct change-points. The 
vast majority of existing theoretical results for offline segmentation and change-point detection consider the 
setting where p is fixed (usually p = 1), and n increases (e.g., Q). This typically corresponds to cases 
where we can sample a continuous signal with increasing density, and wish to locate more precisely the 
underlying change-points as the density increases. 

We propose a radically different analysis, motivated notably by applications in genomics. Here, the 
length of profiles n is fixed for a given technology, but the number of profiles p can increase when more 
samples or patients are collected. The property we would like to study is then, for a given change-point 
detection method, to what extent increasing p for fixed n allows us to locate more precisely the change- 
points. While this simply translates our intuition that increasing the number of profiles should increase the 
statistical power of change-point detection, and while this property was empirically observed in Q, we are 
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Algorithm 2 Group fused LARS algorithm 

Require: centered data Y, number of breakpoints k. 
1: Initialize ^ ^ 0, c ^ X^Y. 
2: for i = 1 to A; do 
3: ifi=ltheii 

4: First change-point : u <— argmin ^gj^ || Cj_, ||, »4 ^ {u}. 
5: end if ^ 

6: Descent direction: compute w ^ (^Xj_^X, jx^ , , then a = X^ Xy^w. 

7: Descent step: for each u G [l,n — 1] find if it exists the smallest positive solution of the 
second-order polynomial in a: 

II " l|2 II ||2 

II Cu,9 (^^U^9 II II Cv^9 ^(^V^9 II 5 

where v is any element of A. 
8: Next change-point: u ^ argmin || cj^, \\, A ^ AU {u}. 

9: Update c ^ c — a^a. 
10: end for 



not aware of previous theoretical results in this setting. In particular we are interested in the consistency 
of our method, in the sense that it should correctly detect the true change-points if enough samples are 
available. 

5.1 Consistent estimation of a single change-point 

As a first step towards the analysis of this "fixed n increasing p" setting, let us assume that the observed 
centered profiles Y are obtained by adding noise to a set of profiles with a single shared change-point 
between positions u and n + 1, for some u G [1, n — 1]. In other words, we assume that 

Y = XP* + W , 

where /3* is an (n — 1) x p matrix of zeros except for the u-th row /3* ,, and is a noise matrix whose entries 
are assumed to be independent and identically distributed with respect to a centered Gaussian distribution 
with variance cr^. In this section we study the probability that the first change-point found by our procedure 
is the correct one, when p increases. We therefore consider an infinite sequence of jumps ( j ) , and 

letting /3p = I X]f=i(/^u i)^' assume that jP = limp_j.oo exists and is finite. We first characterize the 
first selected change-point as p increases. 

Lemma 1. Assume, without loss of generality, that u > n/2, and let, for i G [1, n — 1], 

2 z(n-i) 2 , ^^d^dl jfjn-uf ifi<u, 

Gi = di a H 2~ ^ ^ 2^ ■\2 , . (11) 

n n \u[n — i) otherwise. 

When p +00, the first change-point selected by the group fused Lasso (0) is in argmax Gi with 

probability tending to 1. 

Proof of this result is given in Annex B. From it we easily deduce conditions under which the first 
change-point is correctly found with increasing probability as p increases. Let us first focus on the un- 
weighted group fused Lasso (O, corresponding to the setting = 1 for i = 1, . . . , n — 1. 
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Theorem 2. Let a = u/nbe the position of the change-point scaled in the interval [0, 1], and 



,, (l-a)^(a-^) 
a- i - ^ 

" 2 2n 



(12) 



If < (T^, the probability that the first change-point selected by the unweighted group fused Lasso (0 is 
the correct one tends to 1 as p ^ +oo. When o"^ > a^, it is not the correct one with probability tending to 
1. 

This theorem, the proof of which can be found in Annex C, deserves several comments. 

• To detect a change-point at position u = an, the noise level cj^ must not be larger than the critical 
value (j^ given by (fT2l ). hence the method is not consistent for all positions, (j^ decreases mono- 
tonically from a = 1/2 to 1, meaning that change-points near the boundary are more difficult to 
detect correctly than change-points near the center. The most difficult change-point is the last one 
(u = n — 1) which can only be detected consistently if cj^ is smaller than 



• For a given level of noise o"^, change-point detection is asymptotically correct for any a G [e, 1 — e], 
where e satisfies cr^ = cf\_^, i.e., 



This shows in particular that increasing the profile length n increases the relative interval (as a fraction 
of n) where change-points are correctly identified, and that we can get as close as we want to the 
boundary for n large enough. 

• When (T^ < (T^, the correct change-point is found consistently when p increases, showing the benefit 
of the accumulation of many profiles. 

Theorem |2] shows that the unweighted group fused Lasso (O suffers from boundary effects, since it may 
not correctly identify a single change-points near the boundary is the noise is too large. In fact, Lemma[T]tells 
us that if we miss the correct change-point position, it is because we estimate it more towards the middle 
of the interval (see proof of Theorem [2] for details). The larger the noise, the more biased the procedure 
is. We now show that this issue can be fixed when we consider the weighted group fused Lasso (01) with 
well-chosen weights. 

Theorem 3. The weighted group fused Lasso (0) with weights given by dJ]) correctly finds the first change- 
point at any position with probability tending to 1 as p ^ +oo. 

The proof of Theorem |3] is postponed to Annex D. It shows that the weighting scheme ^ cancels the 
effect of the noise and allows us to consistently estimate any change-point, independently of its position in 
the signal, as the number of signals increases. 



l-l/n 



^ + o(n-i). 
n 
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5.2 Consistent estimation of a single change-point with fluctuating position 

An interesting variant of the problem of detecting a change-point common to many profiles is that of de- 
tecting a change-point with similar location in many profiles, allowing fluctuations in the precise location 
of the change-point. This can be modeled by assuming that the profiles are random, and that the i-th profile 
has a single change-point of value /3j at position Ui, where Ui)^^^ ^ are independent and identically 
distributed according to a distribution P = ® Pu (i.e., we assume Pi independent from Ui). We denote 
= Epi^(3'^ and pi = Pu{U = i) for i G [l,n — 1]. Assuming that the support of Pu is [a,b] with 
l<a<b<n — 1, the following result extends Theorem |2]by showing that the first change-point discov- 
ered by the unweighted group fused Lasso is in the support of Pu under some condition on the noise level, 
while the weighted group fused Lasso correctly identifies a change-point in the support of Pu asymptotically 
without conditions on the noise. 

Theorem 4. 1. Let a = U /nbe the random position of the change-point on [0, 1] and am = a/n and 
OiM = b/n the position of the left and right boundaries of the support of Pu scaled to [0,1]. If 
1/2 € {oimi oim), then for any noise level a"^, the probability that the first change-point selected by 
the unweighted group fused Lasso ^ is in the support of Pu tends to 1 as p ^ +oo. If 1/2 < Um or 
au < 1/2, let 

'r 1 

(13) 

ifOLM < 2 • 

The probability that the first selected change-point is in the support of Pu tends to 1 when cr^ < d'p^. 
When o"^ > ^p^, it is outside of the support of Pu with probability tending to 1. 

2. The weighted group fused Lasso (0) with weights given by ^ finds the first change-point in the support 
of Pu with probability tending to 1 as p ^ +oo, independently of and of the support of Pu- 

This theorem, the proof of which is postponed to Annex E, illustrates the robustness of the method to 
fluctuations in the precise position of the change-point shared between profiles. Although this situation 
rarely occurs when we are considering classical multidimensional signals such as financial time series or 
video signals, it is likely to be the rule when we consider profiles coming from different biological samples, 
where for example we can expect frequent genomic alterations at the vicinity of important oncogenes or 
tumor suppressor genes. Although the theorem only gives a condition on the noise level to ensure that the 
selected change-point lies in the support of the distribution of change-point locations, a precise estimate of 
the location of the selected change-point as a function of Pu, which generalizes Lemma [T] is given in the 
proof. 

5.3 The case of multiple change-points 

While the theoretical results presented above focus on the detection of a single change-point, the real interest 
of the method is to estimate multiple change-points. The extension of Theorem |2] to this setting is, however, 
not straightforward and we postpone it for future efforts. We conjecture that the group fused Lasso estimator 
can, under certain conditions, consistently estimate multiple change-points. More precisely, in order to 
generalize the proof of Theorem [2l we must analyze the path of the vectors (cj^,), and check that, for some 
A in (l3]l or they reach their maximum norm precisely at the true change-points. The situation is more 
complicated than in the single change-point case since, in order to fulfill the KKT optimality conditions, 
the vectors (q ,) must hit a hypersphere at each correct change-point, and must remain strictly within the 
hypersphere between consecutive change-points. This can probably be ensured if the noise level is not too 
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high (like in the single change-point case), and if the positions corresponding to successive change-points 
on the hypersphere are far enough from each other, which could be ensured if two successive change-points 
are not too close to each other, and are in sufficiently different directions. Although the weighting scheme 
(ID ensures consistent estimation of the first change-point independently of the noise level, it may however 
not be sufficient to ensure consistent estimation of subsequent change-points. 

Although we propose no theoretical results besides these conjectures for the case of multiple change- 
points, we provide experimental results below that confirm that, when the noise is not too large, we can 
indeed correctly identify several change-points, with probability of success increasing to 1 as ^» increases. 

5.4 Estimating the number of change-points 

The number of change-points detected by the group fused Lasso in the multidimensional signal depends 
on the choice of A in (O and (01). In practice, we propose the following scheme in order to estimate a 
segmentation and the number of change-points. We try to select a A that over-segments the multidimensional 
signal, that is, finds more change-points that we would normally expect for the given type of signal or 
application. Then, on the set of k change-points found, we perform post-processing using a simple least- 
squares criteria. Briefly, for each given subset ofk'<k change-points, we approximate each signal between 
successive change-points with the mean value of the points in that interval; then, we calculate the total sum 
of squared errors (SSE) between the set of real signals and these piecewise-constant approximations to 
them. Though it may appear computationally intensive or even impossible to do this for all subsets of 
k' < k change-points, a dynamic programming strategy (e.g., |6|) means that the best subset of k' < k 
change-points can be calculated for all A;' € {1, . . . , A;} in 0(A;^). 

It then remains to choose the "best" k' G {1, . . . , k} using, for example, a model-selection strategy. The 
optimal SSE for k' + 1 (which we may call SSE{k' + 1) to ease notation), will be smaller than SSE{k') 
but at a certain point, adding a further change-point will have no physical reality, it only improves the SSE 
due to random noise. Here, we implemented a method proposed in Il26l l6l where we first normalize the SSE 
for k' = 1, . . . ,k into a score J{k') such that J(l) = k and J{k) = 1, in such a way the J{k') has an 
average slope of —1 between 1 and k; we then try to detect a kink in the curve by calculating the discrete 
second derivative of J{k'), and selecting the k' after which this second derivative no longer rises above a 
fixed threshold (typically 0.5). 

6 Experiments 

In this section we test the group fused Lasso on several simulated and real data sets. All experiments were 
run under Linux on a machine with two 4-core Intel Xeon 3. 16GHz processors and a total of 16Gb of RAM. 
We have implemented the group fused Lasso in MATLAB; the package GFLseg is available for downloacfl 

6.1 Speed trials 

In a first series of experiments, we tested the behavior of the group fused Lasso in terms of computational 
efficiency. We simulated multidimensional profiles with various lengths n between 2^ and 2^^, various 
dimensions p between 1 and 2^^, and various number of shared change-points k between 1 and 2^. In each 
case, we first ran the iterative weighted group fused LARS (Section 1431 ) to detect successive change-points, 
and recorded the corresponding A values. We then ran the exact group fused Lasso implementation by block 
coordinate descent (Section HT]) on the same A values. Figure |2] shows speed with respect to increasing one 
of p, n and k while keeping the other two variables fixed, for both implementations. The axes are log-log, 

'Available at |http : //cbio . ensmp . fr /GFLseg] 
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so the slope gives the exponent of the complexity (resp. n, p and k). For the weighted group fused LARS, 
linearity is clearest for k, whereas for n and p, the curves are initially sub-linear, then slightly super-linear for 
extremely large values of n and p. As these time trials reach out to the practical limits of current technology, 
we see that this is not critical - on average, even the longest trials here took less than 200 seconds. The 
weighted fused group Lasso results are perhaps more interesting, as it is harder to predict in advance the 
practical time performance of the algorithm. Surprisingly, when increasing n (p and k fixed) or increasing 
p (n and k fixed), the group fused Lasso eventually becomes as fast the iterative, deterministic group fused 
LARS. This suggests that at the limits of current technology, if k is small (say, less than 10), the potentially 
superior performance of the Lasso version (see later) may not even be punished by a slower run-time with 
respect to the LARS version. We suggest that this may be due to the Lasso optimization problem becoming 
relatively "easier" to solve when n or p increases, as we observed that the Lasso algorithm converged quickly 
to its final set of change-points. The main difference between the Lasso and LARS performance appears 
when the number of change-points increases: with respective empirical complexities cubic and linear in k, 
as predicted by the theoretical analysis. Lasso is already 1,000 times slower than LARS when we seek 100 
change-points. 




n p k 

Figure 2: Speed trials for group fused LARS (top row) and Lasso (bottom row). Left column: varying 
n, with fixed p = W and k = 10; center column: varying p, with fixed n = 1000 and k = 10; right column: 
varying k, with fixed n = 1000 and p = 10. Figure axes are log-log. Results are averaged over 100 trials. 

6.2 Accuracy for detection of a single change-point 

Next, we tested empirically the accuracy the group fused Lasso for detecting a single change-point. We 
first generated multidimensional profiles of dimension p, with a single jump of height 1 at a position u, for 
different values of p and u. We added to the signals an i.i.d. Gaussian noise with variance (t^ = 10.78, 
the critical value corresponding to a = 0.8 in Theorem [2l We ran 1000 trials for each value of u and p, 
and recorded how often the group fused Lasso with or without weights correctly identified the change-point. 
According to Theorem [2l we expect that, for the unweighted group fused Lasso, for 50 < ti < 80 there is 
convergence in accuracy to 1 when p increases, and for u > 80, convergence in accuracy to zero. This is 
indeed what is seen in Figure [3] (left panel), with u = 80 the limit case between the two different modes of 
convergence. The center panel of Figure [3] shows that when the default weights (|5]l are added, convergence 
in accuracy to 1 occurs across all u, as predicted by Theorem [3] In addition, the right-hand-side panel 
of Figure |3] shows results for the same trials except that change-point locations can vary uniformly in the 
interval u ±2. We see that, as predicted by Theorem |4l the accuracy of the weighted group fused Lasso 
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Figure 3: Single change-point accuracy for the group fused Lasso. Accuracy as a function of the number 
of profiles p when the change-point is placed in a variety of positions n = 50 to u = 90 (left and centre 
plots, resp. unweighted and weighted group fused Lasso), or: u = 50±2 to u = 90±2 (right plot, weighted 
with varying change-point location), for a signal of length 100. 

remains robust against fluctuations in the exact change-point location. 

6.3 Accuracy for detecting multiple change-points 

To investigate the potential for extending the results to the case of many shared change-points, we further 
simulated profiles of length n = 100 with a change-point at all of positions 10, 20, ... , 90. We consider 
dimensions p between 1 and 500. Jumps at each change-point of each profile were drawn from a Gaussian 
with mean and variance 1; we then added centered Gaussian noise with 0-2 G {0.05,0.2,1} to each 
position in each profile. For each value of p and a^, we ran one hundred trials of both implementations, with 
or without weights, and recorded the accuracy of each method, defined as the percentage of trials where the 
first 9 change-points detected by the method are exactly the 9 true change-points. Results are presented in 
Figure m (from left to right, resp. cj^ = 0.05, 0.2, 1). Clearly, the group fused Lasso outperforms the group 
fused LARS, and the weighted version of each algorithm outperforms the unweighted version. Although 
the group LARS is usually considered a reliable alternative to the exact group Lasso [21 J, this experiment 
shows that the exact optimization by block coordinate descent may be worth the computational burden if 
one is interested in accurate group selection. It also demonstrates that, as we conjectured in Section [531 the 
group fused Lasso can consistently estimate multiple change-points as the number of profiles increases. 

6.4 Application to gain and loss detection 

We now consider a possible application of our method for the detection of regions with frequent gains 
(positive values) and losses (negative values) among a set of DNA copy number profiles, measured by 
array comparative genomic hybridization (aCGH) technology 1271 . We propose a two-step strategy for 
this purpose: first, find an adequate joint segmentation of the signals; then, check the presence of gain or 
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Figure 4: Multiple change-point accuracy. Accuracy as a function of the number of profiles p wfien 
cliange-points are placed at tlie nine positions {10, 20, ... , 90} and the variance a"^ of the centered Gaussian 
noise is either 0.05 (left), 0.2 (center) and 1 (right). The profile length is 100. 

loss on each interval of the segmentation by summarizing each profile by its average value on the interval. 
Note that we do not assume that all profiles share exactly the same change-points, but merely see the joint 
segmentation as an adaptive way to reduce the dimension and remove noise from data. 

In practice, we used group fused LARS on each chromosome to identify a set of 100 candidate change- 
points, and selected a subset of them by post-processing as described in Section [S!4l Then, in each piecewise- 
constant interval between successive shared change-points, we calculate the mean of the positive segments 
(shown in green in Figures [Sja) and[6tc)) and the mean of the negative segments (shown in red). The larger 
the mean of the positive segments, the more likely we are to believe that a region harbors an important 
common gain; the reasoning is analogous for important common losses and the mean of the negative seg- 
ments. Obviously, many other statistical tests could be carried out to detect frequent gains and losses on 
each segment, once the joint segmentation is performed. 

We compare this method for detecting regions of gain and loss with the state-of-the-art H-HMM method 
ll27l . which has been shown to outperform several other methods in this setting. As ||27l have provided their 
algorithm online with several of their data sets tested in their article, we implemented our method and theirs 
(H-HMM) on their benchmark data sets. 

In the first data set in [27], the goal is to recover two regions - one amplified, one deleted, that are shared 
in 8 short profiles, though only 6 of the profiles exhibit each of the amplified or deleted regions. Performance 
is measured by area under ROC curve (AUC), following ll27l . Running H-HMM with the default parameters, 
we obtained an AUC (averaged over 10 trials) of 0.96 it .01, taking on average 60.20 seconds. The weighted 
group fused LARS, asked to select 100 breakpoints and followed by dynamic programming, took 0.06 
seconds and had an AUC of 0.97. Thus, the performance of both methods was similar, though weighted 
group fused LARS was around 1000 times faster. 

The second data set was a cohort of lung cancer cell lines originally published in G8ll29l . As in ETl . we 
concentrated on the 18 NSCLC adenocarcinoma (NA) cell lines. Figure[5]shows the score statistics obtained 
on Chromosome 8 when using either weighted group fused LARS or H-HMM. Weighted group fused LARS 
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first selected 100 candidate change-points per chromosome, then followed optimization of the number of 
change-points by dynamic programming, took in total 4.7 seconds and finally selected 260 change-points. 
In contrast, H-HMM took 38 minutes (100 iterations, as given in the code provided by the authors). The 
H-HMM scores should look like those shown in Figure 4 (top panel) of [27] ; the difference is either due to 
the stochastic nature of the algorithm or using a different number of iterations than given in the sample code 
by the authors. In any case, at the MYC locus (near 13 x 10^ bp), both methods strongly suggest a common 
gained region. However, the supposed advantage of H-HMM to very sparsely predict common gains and 
losses is not clear here; for example, it gives high common gain confidence to several fairly large genomic 
regions between 9 and 14 x 10^ bp. 




Figure 5: Joint scores for a set of 18 NSCLC adenocarcinoma cell lines. Ha) using weighted group fused 
LARS;|5];b) using H-HMM with the actual code provided by |[27l . 



6.5 Application to bladder tumor aCGH profiles 

We further considered a publicly available aCGH data set of 57 bladder tumor samples 130). Each aCGH 
profile gave the relative quantity of DNA for 2215 probes. We removed the probes corresponding to sex 
chromosomes because the sex mismatch between some patients and the reference made the computation of 
copy number less reliable, giving us a final list of 2143 probes. 

Results are shown in Figure [6] 97 change-points were selected by the weighted group fused LARS; this 
took 1.1 seconds (Figure [6lc)). The H-HMM method (Figure [6td)) took 13 minutes for 200 iterations (after 
100 iterations convergence had not occured). We used the comprehensive catalogue of common genomic 
alterations in bladder cancer provided in Table 2 in 1311 to validate the method and compare with H-HMM. 
Our method (Figure [6tc)) concurred with the known frequently-amplified chromosome arms 20q, 8q, 19q, 
Iq, 20p, 17q, 19p, 5p, 2p, lOp, 3q and 7p, and frequendy-lost 9p, 9q, lip, lOq, 13q, Bp, 17p, 18q, 2q, 
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Figure 6: Bladder cancer profiles. [6ta) shows one of the original 57 profiles and its associated smoothed 
version. Ob) shows the result of superimposing the smoothed versions of the 57 bladder tumor aCGH 
profiles obtained using weighted group fused LARS followed by dimension-selection. |6tc) shows the result 
of transforming the set of smoothed outputs into "scores" for amplification/deletion (see Section 16.41) and 
Hd) the corresponding output for the H-HMM method [27]. Vertical black lines indicate chromosome 
boundaries. 

5q, 18p, 14q and 16q. The only known commonly-lost region which showed unconvincing common loss 
here was 6q. As for the H-HMM method (Figure [Sfd)), it selects a small number of very small regions of 
gain and loss, which are difficult to verify with respect to the well-known frequently amplified arms in IIBTI . 
As is suggested, the method may therefore be useful for selecting the precise location of important genes. 
However, as can be seen in Figure [6ta)-(b), many, but not all, alterations are much larger than those found 
with H-HMM, and where for example there are clearly several localized gains and losses in cliromosome 
8, H-HMM finds nothing at all. Perhaps the complexity of rearrangements in cliromosome 8 is not easily 
taken into account by the H-HMM algorithm. Note finally that the weighted group fused LARS was more 
than 700 times faster than H-HMM. 
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7 Conclusion 

We have proposed a framework that extends total-variation based approximation to the multidimensional 
setting, and developed two algorithms to solve the resulting convex optimization problem either exactly 
or approximately. We have shown theoretically and empirically that the group fused Lasso can consis- 
tently estimate a single change-points, and observed experimentally that this property is likely to hold also 
when several change-points are present. In particular, we observed both theoretically and empirically that 
increasing the number of profiles is highly beneficial to detect approximatively shared change-points, an 
encouraging property for biological applications where the accumulation of data measured on cohorts of 
patients promises to help in the detection of common genomic alterations. 

Although we do not assume that all profiles have the same change-points, we estimate only shared 
change-points. In other words, we try to estimate the union of the change-points present in the profiles. 
This can be useful by itself, eg, for dimension reduction. If we wanted to detect change-points of individual 
profiles, we may either post-process the results of the group fused Lasso, or modify the formulation by, e.g., 
adding a TV penalty to each profile in addition to the group lasso penalty. Similarly, for some applications, 
we may want to add a £i norm to the group fused Lasso objective function in order to constrain some 
or all signals to be frequently null. Finally, from a computational point of view, we have proposed efficient 
algorithms to solve an optimization problem (HJl which is the proximal operator of more general optimization 
problems where a smooth convex functional of U is minimized with a constraint on the multidimensional 
TV penalty; this paves the way to the efficient minimization of such functionals using, e.g., accelerated 
gradient methods |[32l . 

Annex A: Computational lemmas 

In this Annex we collect a few results useful to carry out the fast implementations claimed in Section |4l 
Remember that the n x (n — 1) matrix X defined in Q is defined by Xij = dj for i > j, otherwise. 
Since the design matrix X of the group Lasso problem dSjl is obtained by centering each column of X to 
zero mean, its columns are given by: 



We first show how to compute efficiently X^ R for any matrix R: 
Lemma 5. For any R € M"^^, we can compute C = X"^ R in 0{np) operations and memory as follows: 




T 



Vi = 1, . . . , n — 1 



(14) 



1 



Compute the n x p matrix r of cumulative sums ri 



"^^=1 Rj» by the induction: 



• ri,, = Ri^, 

• For i = 2, . 



n, ri^, 



ri_i,, + Ri^, . 



2. 



For i = 1, . . . ,n 



1, compute Ci,, = di {irn,,/n — rj^,) . 
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Proof. Using (fT4l ) we obtain the i-th row of C = R, for i = 1, . . . , n — 1, as follows: 



j=l I 

□ 

Next, we show how to compute efficiently submatrices of the (n — 1) x (n — 1) matrix X. 

Lemma 6. For any two subsets of indices A = (ai, . . . , a[y!i|) <^'^d B = . . . , i?^ [1, n — 1], the 
matrix X^ j^X,^b can be computed in O ([^|{-B|) in time and memory with the formula: 



G X [1,\B\] , X.^aX.,b 



mm{ai,bj)[n-max{ai,bj)] 
= da.db^ —. (15) 



Proof. Let us denote V = XJ^^X^^b- For any € [1, |^|] x [1, \B\], denoting u = mm{ai,bj) and 
V = max(aj, bj), we easily get from ([141) an explicit formula for Vij, namely, 



u{n — v) 



dudy 



u( - -1] i- - l] +{v -u)- { - -1] + (n-v) 

\n / \n / n \n / nn. 



dudy - 



n 

□ 

The next lemma provides another useful computational trick to compute efficiently X^ XR for any 
matrix R: 

Lemma 7. For any R G we can compute C = X^ XR in 0{np) by 

1. Compute, for i = 1, . . . , n — 1, i?j , = diRi^,. 

2. Compute the 1 x p vector S = {^^Zi Z^- 

3. Compute the {n — 1) x p matrix T defined by Ti^, = Yl^Zl induction: 

• for i = n - 2, . . . ,1, Ti^, = Tj+i,, + Ri^,. 

4. Compute the (n — 1) x p matrix U defined by Ui^, = X^j=i (5' — 7j>) by the induction: 

. C/i,. = S- Ti,.. 

• fori = 2,...,n-l, Ui^, = C/j-i,, + S - Ti^,. 

5. Compute, for i = 1, . . . ,n — 1, Ci , = diUi , 
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Each step in Lemma |7] has complexity 0{np) in memory and time, leading to an overall complexity in 
0{np) to compute X"^ XR. We note that if R is row-sparse, i.e., is several rows of R are null, then the 
first two steps have complexity 0{sp), where s is the number of non-zero rows in R. Although this does 
not change the overall complexity to compute X^ XR, this leads to a significant speed-up in practice when 

s <C n. 

Proof. Let us denote D the (n — 1) x (n — 1) diagonal matrix with entries j = di. By Lemma[6l we 
know that X^ X = DVD, with Vij = min(i,j) [n — max(i,j)] /n, for 1 < i,j < n — 1. Since step 
1 computes R = DR and step 5 computes C = DU, we just need to show that the U computed in step 
4 satisfies U = VRto conclude that C = DVR = DVDR = X^XR. By step 4, U is defined by the 
relation U-i^, — Ui^i^, = S — Ti^, for i = 1, . . . , n — 1 (with the convention C/q,, = 0), therefore we just 
need to show that {Vi^, — R = S — Ti^, for i = 1, . . . , n — 1 to conclude. For 0<j<i<n — 1, 

we note that Vij = j{n — i)/n (with the convention Vo,« = 0) and Vi-ij = j{n — i + l)/n, and therefore 
Vij — Vi-ij = —j/n. For 1 < i < j < n — 1, we have Vij = i{n — j)/n and Vi-ij = {i — — j)/n 
and therefore Vij — Vi-ij = 1 — j/n. Combining these expressions we get, for i = 1, . . . , n — 1: 



n-l 



n— 1 



n 



+ Rj^, — S — Ti 



j=i 



where S and T are defined in steps 2 and 3. This concludes the proof that C = X^ XR. 



□ 





can be computed in 0{\A\p) in time and memory by 



1. For i = 1, . . . , |Aj — 1, compute 





2. Compute the successive rows of C according to: 




Ci,, = d-l (Ai_i - A,) for I = 2,..., \A\ - 1 



(16) 




Proof. Let us denote V = X^ j^X,^a- By Lemma[6]we know that, for 1 < i < j < |^ 



ai{n — Qj) 



n 
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V being symmetric semi-separable, one can easily check that V is invertible and admits as inverse a tridiag- 
onal matrix with the following entries |[33l : 



Vri' = d-'( + ) fori = l,...,|^|, 

d-^d-^ 

^M+i = ^,;U = --^^^^ forz = l,...,|^|-l, 

where by convention we define ao = and a|yi|_|_i = n. This tri-diagonal structure allows successive rows 
of C to be expressed as a sum of just a few terms. More precisely, for 1 < i < \A\, we obtain: 

d„^ ^d„^ Rj-^\ » r, / 1 1 \ d„^d„^^ , Ri4--i • 

L-i,. = h a Ki^, h 



Hi — a,i-i \ai — ai-i ai+i — aij Oj+i — a 



= d'l (Ai_i - A,) . 

Similarly, for i = 1 and i = \ A\we easily recover ( fT6] ). □ 



Annex B: Proof of Lemma [T] 



The solution of ([Hi is constant, i.e., corresponds to /3 = (no change-point), as long as the KKT conditions 
(flOl ) are satisfied for /3 = 0. This translates to j| Xj^Y || < A for all i. The first change-point occurs when 
A = maxj II Xj-Y ||, and the change-point is precisely located in the position i that reaches the maximum. 
Therefore the first change-point is the row with the largest Euclidean norm of the matrix: 

c = X^Y = X~^X(3* + X^W . 

The entries of the matrix c are therefore jointly Gaussian. Since only the n-th row /3„ , of /3 is non-zero, we 
get 

E{c) = X^XP* = X''X,,uPl. . 

Using Lemma[6]we compute: 

iorl <i<u, 



On the other hand, by (fT4l) we have for any i G [1, n — 1], 



didu^^^P*, foru<i<n-l. 



X^W 



di 



1,9 



j=i+i 



Since 



where 5ij is the Dirac function, we have for 1 < i < j < n — 1: 



E 



X^W 



X~^W 



didj 



i| --1 

n 



n I n \ n I nn 



i{n-j) 2j 

aiUj (T In. 

n 



(18) 



(19) 
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In summary, we have shown that c is jointly Gaussian with E (q^,) given by ([T8] ) and covariance between 

Cj,, and Cj^, given by ([T9l ). 

In particular, if we denote Fj = || Cj^, |p , then, for i < u, Fiu/ [dfi{n — follows a non-central 
distribution with p degrees of freedom and non-centrality parameter p^pd'^i{n — uf' j \n{n — i)<7^] . In 

particular, 

EFi = p/3p dfdi +Pdt- -(7^ , 

and since linip_^_(_oo 

/3p = we get that Fi/p converges in probability to 

= ^ = P'dUl'^^^^ + df^^^a^ . (20) 
A similar computation shows that for i > u, F^/p converges in probability to 

Gi = /S^d^^d^ — ^ ^ ^ + o!?^ . (21) 

Note that (|20l ) and (|2TI ) are equivalently defined in ([TT]) . Now, let 1/ = argmax Gj. For any v ^ V 

and J ^ y, the probability of the event Fy > Fj tends to 1, because Gy > Gj. By the union bound the 
probability of the event max^^y Fi < max^gy Fy also converges to 1, showing that the probabiUty to select 
a change-point in V converges to 1 as p — )• +oo. □ 



Annex C: Proof of Theorem |2] 

By Lemma [U we know that the first change-point selected by ^ is in argmax Gj with probability 
tending to 1 as p increases, where Gj is defined in (fTTI) . We will therefore asymptotically select the correct 
change-point u if and only if Gu = maxjg[i^_]^] Gj. Remember we assume, without lack of generality, that 
u > n/2. For n < i < n — 1, we observe that Gj given by (|2TI) is a decreasing function of i as a sum of 
two decreasing functions. Therefore, it always holds that Gu = maxjg Gj, and we just need to check 
whether or not G„ = maxjgji „] Gj holds. 

For i G [l,u\, Gi given by (l20l ) is a second-order polynomial of i, which is equal to at i = and 
strictly positive for i = u. Therefore Gu = maxjg[i^] Gj if and only if Gu > Gu-i- Let us therefore 
compute: 

f \2 2 

Gu - Gu^i = P^^^^^^P- [u^ -{u- if] + — [u{n - u) - (n - l)(n - ii + 1)] 



where a = u/n and 



I3'^{2u-l){n-uf a'^{n-2u + l) 

n 



_ ^52V^ ~ V" - 2^ 



(22) 



a' = n/3 





2 2n 



a 1 1 



This shows that, when a > 1/2 + l/(2n), G„ > Gu~i if and only if a < a. On the other hand, when 
a = 1/2 or 1/2 + l/(2n), we have always that Gu > Gu-i- □ 
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Annex D: Proof of Theorem |3] 



As for the proof of Theorem |2l we need to check whether or not Gu = maxjg[i „_i] d, where Gi is 
defined in ([TTI ). to deduce whether the method selects the correct change-point u or a different position with 
probability tending to 1 when p increases. Substituting weights d,; defined in Q into Gi, we obtain: 

a^^,2^^2^Uin-u)/u{n-^) if ^ < u , ^^3) 
\u{n — i)/i{n — u) otherwise. 

It is then easy to see that (l23l) is increasing on [1, n], and decreasing on [u, n — 1], showing that we always 
have argniax Gi — u. The result then follows from Lemma [Tl CH 



Annex E: Proof of Theorem 3] 

Following the proof of Lemma[Tl let us estimate -Fj = || Cj , |p for i € [1, n — 1]. For any j G we first 
observe by ([TSl l that 



X^Xl3 



'ddu.'^^^^^j if 
^ diduj otherwise . 



Therefore, 



2 rl'i P 



(n - UjY l{i < Uj) + (n - iy Ujl{i > Uj] 



and by independence of /3i and Ui'. 

p 



1 2 



"^Pudl 



n-l 



+ E Pudl{n-u) 



.u=l 



u=i+l 



Since (/3j, C/i)j=i,...,p are independent of the noise, we obtain that Fi/p converges in probability to 



(24) 



(25) 



r/2 



n-l 



^p„d^u2(n - i)2 + ^ pudl{n-u) 



+ di 



, i[n — I 



-a 



n 



(26) 



_u=l u=i+l 

As in Lemma [J we can conclude that the method will select the position 

u = argmax Gi 

uG[l,n-l] 

with probability tending to 1 as p increases. 

Let us now assume that the support of Pu is an interval [a, b] (corresponding to a possible range of 
fluctuation of a change-point). Then, we observe that for i < a, Gi in (l26l ) reduces to 



J2 



+ ^Pudl ( 



\2 -2 

n — u) I 



+ di- -a^ 



n 



(27) 



-p^'^E[du{n-U)f + df^''-'^-^ 



-a 



n 

Let us now consider the two possible weighting schemes. 
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In the unweighted case di = 1 for i = 1, ... . 



n 



1, we obtain from (|27] ) that for i < a: 



n 



(28) 



While the first term in (|28I) is strictly increasing on [0, a], the second term moves the maximum of Gi 
towards n/2. This shows that the maximum of d is always at least a when a < n/2. By symmetry, 
it is also always smaller or equal to b when b > n/2. When n/2 € [a, b], we deduce that for any 
(T^ > 0, u € [a, 6]. Otherwise, let us suppose without lack of generality that n/2 < a < b. Then, Gi 
being quadratic on [0, a] and equal to at 0, the maximum of Gi will not occur before a if and only if 
Ga-i < Ga- ^ computation similar to the one in the proof of Theorem |2] shows that 



Ga — Ga- 



2 (o-^ 



where 



Pi-2 



,E(1 



a 



a 



OLr 



1 1 

2 2^ 



OLr, 



2n 



This shows that Ga > Ga-i if and only if a"^ < a"^. Since b > n/2, we also know that u < b, i.e., 
u G [a, b] in that case. The case 1 < a < b < n/2 can be treated similarly. To conclude the proof it 
suffices to observe that 



E{1 - a)2 = (1 - Eaf + var(a) . 



In the weighted case di 



for i = 1, 



Gi=^^ 



-E 



n — 1, we obtain from (1261) and (1271) that for i < a: 

n-U' 



n 



U 



(29) 



This is always an increasing function of i on [1, a], showing that the maximum of Gi can not be strictly 
smaller than a. By symmetry, it can also never be larger than b, from which we conclude that it is 
always between a and b, i.e., in the support of Pu. 



□ 
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